Setting the options for knitr…
library(knitr)
knitr::opts_chunk$set(results = 'asis', # Can also be set at the chunk-level
echo = TRUE,
comment = NA,
prompt = FALSE,
warning = FALSE,
message = FALSE,
fig.align = "center",
fig.width = 8.125,
out.width = "100%",
fig.path = "D:/Figures/figures_sumap/",
dev = c('png', 'tiff'),
dev.args = list(png = list(type = "cairo"), tiff = list(compression = 'lzw')),
cache = TRUE,
cache.path = "D:/cache/cache_sumap/",
autodep = TRUE)
options(width = 1000, scipen = 999, knitr.kable.NA = '')Loading libraries…
library(dplyr)
library(tidyverse) # The tidy verse
library(gridExtra)
library(uwot)
#require(devtools)
#install_version("clues", version = "0.6.2.2", repos = "http://cran.us.r-project.org")
library(clues)
library(parallel)
library(plotly)
library(splitstackshape)Cleaning the memory…
rm(list = ls(all.names = TRUE))Loading previously defined objects (dataset, sets of predictors…
load(file = "Study_objects.RData")We define the function get_umap_and_silhouette() which
takes as input a data frame (or tibble) which contains each raw initial
features from the three sets (Bioacoustic,
DCT, MFCC) and target labels
(individual or type). The target parameter
target_DV that specifies the supervised target information
passes this target column to the UMAP model during fitting and it will
make use of it to perform a supervised dimension reduction (S-UMAP).
The umap() function in the uwot package
includes several others parameters to build a low-dimensional graph. The
four main parameters are n_neighbors which determines the
number of neighbors used in constructing the nearest neighbor graph,
min_dist which determines the allowed separation between
connected embeddings, n_components which is the
dimensionality of the embedding space, and metric which
defines the distance metric (e.g., Euclidean, cosine) used to define the
distances between points in the high-dimensional data space. We used the
default settings for each, i.e., 100 neighbors, a minimal distance of
0.01, two dimensions and the Euclidean distance metric.
UMAP is a stochastic algorithm. This means that different runs of
UMAP may have variations. The function performs the dimensional
reduction n times (with n = 100 by default) with different
random seeds to compute a distribution of silhouette scores for the
various observations. For each repetition, in order to estimate the
quality of the partitioning provided by UMAP and thus quantify the
degree of clustering of call types and individual signatures, we compute
the values of the silhouette coefficients (Rousseeuw, 1987). We use to
this end function get_Silhouette in the clues
package.
The function returns both a list of SUMAP-reduced coordinates and a list of silhouette scores for the calls (for the n repetitions).
get_umap_and_silhouette <- function(df_umap, target_DV, target_algo, target_set, n_neighbors = 100, metric = "euclidean", min_dist = 0.01, n = 100) {
df_umap <- df_umap %>%
filter(algo == target_algo, set == target_set) %>%
dplyr::select(-algo, -set)
if (target_DV == "individual")
other_V <- "type"
if (target_DV == "type")
other_V <- "individual"
m <- df_umap %>%
dplyr::select(-id, -individual, -type, -sequence)
umap_list <- vector(mode='list', length = n)
sil_list <- vector(mode='list', length = n)
for (i in 1:n) {
set.seed(i)
umap_coords <- m %>%
umap(n_neighbors = n_neighbors, n_components = 2, metric = "euclidean", min_dist = 0.01,
n_threads = detectCores() - 2, y = df_umap[[target_DV]]) %>%
data.frame()
umap_coords[[target_DV]] <- df_umap[[target_DV]]
umap_coords[[other_V]] <- df_umap[[other_V]]
umap_coords$sequence <- df_umap$sequence
silhouette <- get_Silhouette(as.matrix(umap_coords[, 1:2]), umap_coords[[target_DV]], disMethod = "Euclidean")
silhouette.df <- data.frame(sil = silhouette$s,
observed = umap_coords[[target_DV]],
id = df_umap %>% pull(id))
umap_list[[i]] <- umap_coords %>% mutate(index = i)
sil_list[[i]] <- silhouette.df %>% mutate(index = i)
}
return (list(umap_list = umap_list, sil_list = sil_list))
}We also define the function get_sil_plot() which takes
as input a data table which contains silhouette scores for all our calls
and for a number of repetitions (referred to with the variable index in
the data table) (see above) We first compute the average silhouette
scores of the calls overt the repetitions, and order the table in order
to visualize things properly. We also first compute a grand average for
each level of the target variable in the initial SUMAP, then the
standard deviation of the average score of each level over the
repetitions. The values are then displayed in the figure.
get_sil_plot <- function(dis.silhouette.df) {
silhouette.df <- dis.silhouette.df %>%
group_by(observed, id) %>%
summarize(sil = mean(sil)) %>%
ungroup() %>%
arrange(observed, -sil) %>%
mutate(name = as.factor(1:nrow(.)))
means.per.group <- silhouette.df %>%
mutate(name = as.integer(name)) %>%
group_by(observed) %>%
summarize(mean.group = round(mean(sil), 2), mean.name = mean(name), min = name[which.min(name)], max = name[which.max(name)])
sd.per.group <- dis.silhouette.df %>%
group_by(index, observed) %>%
summarize(sil = mean(sil)) %>%
ungroup() %>%
group_by(observed) %>%
summarize(sd.group = round(sd(sil), 3))
means.per.group <- means.per.group %>% inner_join(sd.per.group)
p <- silhouette.df %>%
ggplot(., aes(x = name, y = sil, color = observed, fill = observed)) +
geom_col() +
labs(y = "Silhouette value",
x = "",
fill="Cluster",
color="Cluster",
title = paste0("Silhouette plot - ", "Raw features", "\n", "Bioacoustic, DCT, MFCC"),
subtitle = paste0("Average silhouette value: ", round(mean(silhouette.df$sil), 2)) ) +
scale_color_brewer(palette = "Paired") +
scale_fill_brewer(palette = "Paired") +
ggplot2::ylim(c(-1.1, 1.1)) +
geom_text(data = means.per.group, aes(x = mean.name, y = mean.group, label = mean.group), color="black", hjust = -0.2, vjust = -0.3) +
geom_segment(data = means.per.group, aes(x = min - 0.5, xend = max + 0.5, y = mean.group, yend = mean.group), size = 0.75, linetype = "dashed", color="black") +
geom_segment(data = means.per.group, aes(x = (min + max) * 0.5, xend = (min + max) * 0.5, y = mean.group - 1.96 * sd.group, yend = mean.group + 1.96 * sd.group), size = 0.5, linetype = "dashed", color="black") +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
coord_flip() +
guides(fill="none", color="none")
return (p)
}The function get_umap_plot() takes the SUMAP-reduced
coordinates of the calls and the target variable for the supervised
dimensionality reduction, and returns a scatter plot.
get_umap_plot <- function(umap_df, target_DV, title, size = 1, alpha = 0.5) {
p_umap <- umap_df %>% ggplot(aes(x = X1, y = X2, col = .data[[target_DV]])) +
geom_point(size = size, alpha = alpha) +
scale_color_brewer(palette = "Paired") +
guides(colour = guide_legend(override.aes = list(size = 2, alpha = 1.0))) +
ggtitle(title)
return (p_umap)
}The function get_umap_plot() takes the SUMAP-reduced
coordinates of the calls and the target variable for the supervised
dimensionality reduction, and returns one scatter plot per level of the
variable complementary to the target variable, i.e.,
type for individual, and
individual for type.
get_umap_plot_decomposed <- function(umap_df, target_DV, title, size = 1, alpha = 0.5) {
if (target_DV == "individual")
other_V <- "type"
if (target_DV == "type")
other_V <- "individual"
p_umap <- umap_df %>% ggplot(aes(x = X1, y = X2, col = .data[[target_DV]])) +
geom_point(size = size, alpha = alpha) +
scale_color_brewer(palette = "Paired") +
guides(colour = guide_legend(override.aes = list(size = 2, alpha = 1.0))) +
ggtitle(title) +
facet_wrap(vars(.data[[other_V]]))
return (p_umap)
}We call the function get_umap_and_silhouette() to
collect results for type with all the predictors
contained in our three main sets.
target_columns <- c("id", "individual", "type", "sequence", Bioacoustic_set, DCT_set, MFCC_set) %>% unique()
df_umap <- df %>%
select(one_of(target_columns)) %>%
mutate(algo = "Raw features", set = "Bioacoustic, DCT, MFCC")
results_type <- df_umap %>% get_umap_and_silhouette("type", "Raw features", "Bioacoustic, DCT, MFCC", n = 100)We concatenate the elements of the list of tables silhouette scores into a single table.
distrib_silhouette_df_type <- do.call(rbind, results_type$sil_list)umap_type <- results_type$umap_list[[1]] %>% get_umap_plot("type", "UMAP - Raw features\nBioacoustic, DCT, MFCC")
umap_typeresults_type$umap_list[[1]] %>% plot_ly(x = ~X1, y = ~X2, color = ~type, colors = "Paired", type = 'scatter', mode = 'markers',
hoverinfo = "text",
hovertext = paste("Type:", results_type$umap_list[[1]]$type,
"<br>Individual:", results_type$umap_list[[1]]$individual,
"<br>Sequence:", results_type$umap_list[[1]]$sequence)) %>%
layout(
plot_bgcolor = "#ffffff",
legend=list(title=list(text='Type')),
xaxis = list(
title = "0"),
yaxis = list(
title = "1")) umap_decomposed_type <- results_type$umap_list[[1]] %>% get_umap_plot_decomposed("type", "UMAP - Raw features\nBioacoustic, DCT, MFCC")
umap_decomposed_typesil_type <- distrib_silhouette_df_type %>% get_sil_plot()
sil_typeWe call the function get_umap_and_silhouette() to
collect results for individual, once again with all the
predictors contained in our three main sets.
target_columns <- c("id", "individual", "type", "sequence", Bioacoustic_set, DCT_set, MFCC_set) %>% unique()
df_umap <- df %>%
select(one_of(target_columns)) %>%
mutate(algo = "Raw features", set = "Bioacoustic, DCT, MFCC")
results_individual <- df_umap %>% get_umap_and_silhouette("individual", "Raw features", "Bioacoustic, DCT, MFCC", n = 100)We concatenate the elements of the list of tables silhouette scores into a single table.
distrib_silhouette_df_individual <- do.call(rbind, results_individual$sil_list)cols = c("#C84248", "#8569D5", "#5E738F", "#D1A33D", "#8A7C64", "#599861", "#89C5DA", "#DA5724", "#74D944", "#CE50CA")
umap_individual <- results_individual$umap_list[[1]] %>%
get_umap_plot("individual", "UMAP - Raw features\nBioacoustic, DCT, MFCC") +
scale_color_manual(values = cols)
umap_individualresults_individual$umap_list[[1]] %>% plot_ly(x = ~X1, y = ~X2, color = ~individual, colors = c("#C84248", "#8569D5", "#5E738F", "#D1A33D", "#8A7C64", "#599861", "#89C5DA", "#DA5724", "#74D944", "#CE50CA"), type = 'scatter', mode = 'markers',
hoverinfo = "text",
hovertext = paste("Type:", results_individual$umap_list[[1]]$type,
"<br>Individual:", results_individual$umap_list[[1]]$individual,
"<br>Sequence:", results_individual$umap_list[[1]]$sequence)) %>%
layout(
plot_bgcolor = "#ffffff",
legend=list(title=list(text='Individual')),
xaxis = list(
title = "0"),
yaxis = list(
title = "1")) umap_decomposed_individual <- results_individual$umap_list[[1]] %>%
get_umap_plot_decomposed("individual", "UMAP - Raw features\nBioacoustic, DCT, MFCC") +
scale_color_manual(values = cols)
umap_decomposed_individualsil_individual <- distrib_silhouette_df_individual %>%
get_sil_plot() +
scale_color_manual(values = cols)
sil_individualgrid.arrange(
umap_type, umap_individual,
sil_type, sil_individual,
nrow = 2, ncol = 2,
heights = c(0.5, 1)
)R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64
(64-bit)
OS version: Windows 10 x64 (build 22621)
parallel: Support for Parallel computation in R
version 4.1.3stats: The R Stats Package
version 4.1.3graphics: The R Graphics
Package version 4.1.3grDevices: The R
Graphics Devices and Support for Colours and Fonts version
4.1.3utils: The R Utils Package version
4.1.3datasets: The R Datasets Package
version 4.1.3methods: Formal Methods and
Classes version 4.1.3base: The R Base
Package version 4.1.3splitstackshape: Stack
and Reshape Datasets After Splitting Concatenated Values version
1.4.8plotly: Create Interactive Web Graphics via
‘plotly.js’ version 4.10.1clues: Clustering
Method Based on Local version 0.6.2.2uwot:
The Uniform Manifold Approximation and Projection (UMAP) Method for
Dimensionality Reduction version 0.1.14Matrix:
Sparse and Dense Matrix Classes and Methods version
1.4-1gridExtra: Miscellaneous Functions for
“Grid” Graphics version 2.3forcats: Tools
for Working with Categorical Variables (Factors) version
0.5.2stringr: Simple, Consistent Wrappers for
Common String Operations version 1.4.1purrr:
Functional Programming Tools version
0.3.4readr: Read Rectangular Text Data
version 2.1.3tidyr: Tidy Messy Data version
1.2.1tibble: Simple Data Frames version
3.1.8ggplot2: Create Elegant Data Visualisations
Using the Grammar of Graphics version
3.4.0tidyverse: Easily Install and Load the
‘Tidyverse’ version 1.3.2dplyr: A Grammar of
Data Manipulation version 1.0.10knitr: A
General-Purpose Package for Dynamic Report Generation in R version
1.41